Rows: 116 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): type, var, description, notes
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Results: Donor Importance
Show tabs for Years 1,3,5 (and 90d?)
Save data so can load in model_data and abnormal_data
Calibration Plots are essentially the same as what I’m calling unexplained effects
rerun data and save data files (as csv or rds)
Causal Forests
Add back in PTR_SEQUENCE_NUMBER! This is useful to say doesn’t influence outcomes!
Outliers in TBILI_CAND_TX. Truncate at 40?
1 MODEL FIXES
1.1 functions.R
- add … to model_refit() for chdy models since no tuning parameters
- add importance = impurity to rf models in model_refit()
2 data split
performance dropped when I used the TX_YEAR nested cv. This is probably because the training data was smaller. Originaly, I held out 200 observations, but in this strategy the holdout data was up to 450 or so.
3 TODO:
- Literature review: all existing models that use donor variables.
- in peds
- also in adults Argument: the donor variables may not influence survival enough to consider.
3.1 Notes
This is using the classification scenario like used in Miller et al (Ped Tx, 2019). But may want to compare with SRTR model https://www.srtr.org/tools/posttransplant-outcomes/ that is based on Cox-PH model (classification vs. survival)
Use Repeated/Nested CV to reduce variability in the:
- tuning parameters
- variable importance scores
- test performance
- calibration/diagnostic assessment
Because we are removing censored observations, the actual survival rate is higher. Our estimates are biased. We don’t adjust because our goal is evaluating donor variables, not actual survival modeling.
3.2 Hypothesis
The primary hypothesis is: Given a normal echo, ischemic time < 6hrs and donor < 30 yrs, no other donor variables are related to outcomes (using current acceptance practice). The original hypothesis was: Donor Characteristics aren’t related to outcome. We switched to make the paper clearer and more direct.
3.3 Approach
Remove transplants with abnormal echos.
Create nested cross-validation.
- The outer split was on TX_YEAR, so all test performance estimates are based on one-year’s worth of transplants.
- the inner split 10-fold cross-validation using status1 (one-year survival) as stratifying variables
- Because of censoring the counts (and proportion) of train/test are a bit different.
Fit three models (RF, BT, LR) using several predictor sets (e.g., all, candidate only, donor only, candidate_comparison_ischemic) on the training data. Used 10-fold cross-validation to select tuning (or hyper) parameters. This provides: a) CV estimated predictive performance, b) the optimal tuning parameters, and c) the final model that uses the optimal tuning parameter(s) to refit with all the training data. This last step is used to obtain feature importance scores and a prediction function to compute test performance.
- Fit a final model using the optimal tuning parameter for each metric (ROC, log-loss, and brier score).
Assess donor importance using:
- Compare predictive performance on CV across predictor sets (for all three models)
- Repeat on test data
- Explore feature importance scores for models fit will all predictors and best models.
- Predictive diagnostics: any unexplained patterns in predictive residuals. Use candidate only (or best simple model).
4 Overall Survival
NOTE: the 5-year survival for 2018 and 2019 is badly skewed! Don’t use for anything!!! - Need to think about including in the cross-validation.
NOTE: 2014 is an outlier. Survival drops below 90%.
Table of transplants and survival. Note: n_tx are the total number of transplants with normal echos. But the other values are based on unknown/censored outcomes being removed!


Training and Test data sizes:
5 Predictive Performance
| model | name |
|---|---|
| lr | Logistic Regression (lasso penalty) |
| bt | Boosted Trees (xgboost) |
| bt1 | Boosted Stumps (GAM) |
| rf | random forest |
| chdy | Choudhry’s model (ECMO, Ventilator, AGE<1, eGRF, TBILI) |
This shows the predictive performance by metric using the averages from predicting each of the 10 years using models developed from the other 9 years.
5.1 Nested CV Results
Log Loss
Brier Score
AUC
Number of times model/var had best performance:
Log Loss
Brier Score
AUC
Number of times model/var had best performance:
NOTE: the outcome data is censored at 5 year. This is probably not good to report this.
Log Loss
Brier Score
AUC
Number of times model/var had best performance:
5.2 Modeling Predictive Performance
This implements a mixed-effects (hierarchical) Bayesian model to capture which elements drive predictive performance. The model is:
performance ~ intercept + model_effects + var_effects + (1|resample)
where the intercept corresponds to baseline performance with model = logistic regression and variable set = candidate only.


Model Output:

Predicted performance on another year’s worth of transplants. Using the baseline logistic regression model with candidate only variables. The expected Log Loss is 0.237 with 95% Bayesian prediction interval of [0.181, 0.296].

Model Output:

Predicted performance on another year’s worth of transplants. Using the baseline logistic regression model with candidate only variables. The expected Brier Score is 0.063 with 95% Bayesian prediction interval of [0.045, 0.081].

Model Output:

Predicted performance on another year’s worth of transplants. Using the baseline logistic regression model with candidate only variables. The expected AUROC is 0.731 with 95% Bayesian prediction interval of [0.632, 0.831].

6 Predictive Diagnostics
Get predictions from optimal logistic regression model using candidate only variables and 1-year survival as the outcome of interest.
Use the predicted probability of survival as an offset in modeling the hold-out data. We then model the hold-out data for each predictor variable and check for significant effects which would indicate evidence that the variable should be included in the model.
- This is especially important for checking for unaccounted donor, offer, or donor-candidate effects.
\[ \log\left( \frac{p}{1-p} \right) = \log\left( \frac{\hat{p}}{1-\hat{p}} \right) + s(x) \] Predictions:
[1] "starting split 1 of 10"
[1] "starting split 2 of 10"
[1] "starting split 3 of 10"
[1] "starting split 4 of 10"
[1] "starting split 5 of 10"
[1] "starting split 6 of 10"
[1] "starting split 7 of 10"
[1] "starting split 8 of 10"
[1] "starting split 9 of 10"
[1] "starting split 10 of 10"
6.1 Calibration of predicted probabilities
The 1-year survival models using logistic regression with candidate only variables have good calibration. Note that 90% of transplants have a predicted one year survival over 0.85.


6.2 Unexplained effects
Calibration by predictor variable (unexplained effects)
Run through all TX_YEARs and all variables. Fit a calibration model with offset as the model predictions (logit scale). Check for any statistically significant patterns.
[1] "starting split_id: 1"
[1] "starting split_id: 2"
[1] "starting split_id: 3"
[1] "starting split_id: 4"
[1] "starting split_id: 5"
[1] "starting split_id: 6"
[1] "starting split_id: 7"
[1] "starting split_id: 8"
[1] "starting split_id: 9"
[1] "starting split_id: 10"
`summarise()` has grouped output by 'var', 'metric'. You can override using the
`.groups` argument.
Combine results from all splits before modeling:
6.3 Partial Effects
The previous analysis suggests that the candidate only predictions explain the outcomes pretty well. That is, there are no residual patterns in the donor, donor-candidate, or offer variables that would offer significant improvements in predictions.
However, we know that some variables, like Ischemic Time and Donor-Candidate size compatability, should be related to outcomes. What is potentially happening is that the candidate variables are capture the effects of donor and offer variables due to associations.
Per split:
[1] "starting split_id: 1"
[1] "starting split_id: 2"
[1] "starting split_id: 3"
[1] "starting split_id: 4"
[1] "starting split_id: 5"
[1] "starting split_id: 6"
[1] "starting split_id: 7"
[1] "starting split_id: 8"
[1] "starting split_id: 9"
[1] "starting split_id: 10"
`summarise()` has grouped output by 'var'. You can override using the `.groups`
argument.
# A tibble: 408 × 5
var metric min less_05 less_01
<chr> <chr> <dbl> <int> <int>
1 AGE_CAND brier_class 0 10 10
2 AGE_CAND mn_log_loss 0 10 10
3 AGE_CAND roc_auc 0 10 10
4 AGE_DON brier_class 0 10 10
5 AGE_DON mn_log_loss 0 10 10
6 AGE_DON roc_auc 0 10 10
7 BSA_CAND brier_class 0 10 10
8 BSA_CAND mn_log_loss 0 10 10
9 BSA_CAND roc_auc 0 10 10
10 BSA_DON brier_class 0 10 10
11 BSA_DON mn_log_loss 0 10 10
12 BSA_DON roc_auc 0 10 10
13 CAND_DIAG brier_class 0 10 10
14 CAND_DIAG mn_log_loss 0 10 10
15 CAND_DIAG roc_auc 0 10 10
16 CAND_DIAG_CODE brier_class 0 10 10
17 CAND_DIAG_CODE mn_log_loss 0 10 10
18 CAND_DIAG_CODE roc_auc 0 10 10
19 CAND_DIAG_full brier_class 0 10 10
20 CAND_DIAG_full mn_log_loss 0 10 10
21 CAND_DIAG_full roc_auc 0 10 10
22 CAND_DIAG_primary brier_class 0 10 10
23 CAND_DIAG_primary mn_log_loss 0 10 10
24 CAND_DIAG_primary roc_auc 0 10 10
25 FUNC_STAT_CAND_REG brier_class 0 10 10
26 FUNC_STAT_CAND_REG mn_log_loss 0 10 10
27 FUNC_STAT_CAND_REG roc_auc 0 10 10
28 FUNC_STAT_CAND_TX brier_class 0 10 10
29 FUNC_STAT_CAND_TX mn_log_loss 0 10 10
30 FUNC_STAT_CAND_TX roc_auc 0 10 10
31 HEIGHT_CAND_CM brier_class 0 10 10
32 HEIGHT_CAND_CM mn_log_loss 0 10 10
33 HEIGHT_CAND_CM roc_auc 0 10 10
34 HEIGHT_DON_CM brier_class 0 10 10
35 HEIGHT_DON_CM mn_log_loss 0 10 10
36 HEIGHT_DON_CM roc_auc 0 10 10
37 INFECTION_CAND brier_class 0 10 10
38 INFECTION_CAND mn_log_loss 0 10 10
39 INFECTION_CAND roc_auc 0 10 10
40 TRANSFUSIONS_CAND brier_class 0 10 10
# ℹ 368 more rows
`summarise()` has grouped output by 'var'. You can override using the `.groups`
argument.
# A tibble: 136 × 5
var metric min less_05 less_01
<chr> <chr> <dbl> <int> <int>
1 AGE_CAND mn_log_loss 0 10 10
2 AGE_DON mn_log_loss 0 10 10
3 BSA_CAND mn_log_loss 0 10 10
4 BSA_DON mn_log_loss 0 10 10
5 CAND_DIAG mn_log_loss 0 10 10
6 CAND_DIAG_CODE mn_log_loss 0 10 10
7 CAND_DIAG_full mn_log_loss 0 10 10
8 CAND_DIAG_primary mn_log_loss 0 10 10
9 FUNC_STAT_CAND_REG mn_log_loss 0 10 10
10 FUNC_STAT_CAND_TX mn_log_loss 0 10 10
11 HEIGHT_CAND_CM mn_log_loss 0 10 10
12 HEIGHT_DON_CM mn_log_loss 0 10 10
13 INFECTION_CAND mn_log_loss 0 10 10
14 TRANSFUSIONS_CAND mn_log_loss 0 10 10
15 VENTILATOR_CAND_TX mn_log_loss 0 10 10
16 VENT_SUPPORT_TX mn_log_loss 0 10 10
17 WEIGHT_CAND_KG mn_log_loss 0 10 10
18 WEIGHT_DON_KG mn_log_loss 0 10 10
19 BMI_DON mn_log_loss 0.00000157 10 10
20 ECMO_CAND_TX mn_log_loss 0 9 9
21 TBILI_CAND_TX mn_log_loss 0 9 9
22 MED_COND_CAND_TX mn_log_loss 0.000000862 10 8
23 PO2_DON mn_log_loss 0.00000352 8 8
24 VENTILATOR_CAND_REG mn_log_loss 0 9 7
25 ALBUM_CAND_REG mn_log_loss 0.00000434 8 7
26 eGFR_CAND_TX mn_log_loss 0 7 7
27 DEATH_CIRCUM_DON mn_log_loss 0.0000605 9 6
28 ECMO_CAND_REG mn_log_loss 0 8 6
29 POSTERIOR_WALL mn_log_loss 0.000000449 8 6
30 DIAL_CAND mn_log_loss 0 7 6
31 HEIGHT_RATIO mn_log_loss 0.00000471 7 6
32 SEPTAL_WALL mn_log_loss 0.00000505 7 6
33 TRANSFUS_TERM_DON mn_log_loss 0.000269 9 5
34 ISCHTIME mn_log_loss 0 7 5
35 VAD_CAND_TX mn_log_loss 0.00000522 7 5
36 RACE_CAND mn_log_loss 0.000136 7 5
37 ABO_STR mn_log_loss 0 6 5
38 CREAT_CAND_TX mn_log_loss 0.000296 6 5
39 CPRA mn_log_loss 0 5 5
40 CPRA_PEAK mn_log_loss 0 5 5
# ℹ 96 more rows
Combined over all splits
6.3.1 Plots of Partial Effects
This is like a SHAP value - how the individual variables relate to predictions. TODO: Add color to points. Survived vs. not. Do this in two calls to geom_point() so the not-survived (minority class) shows on top.
`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

7 Variable Importance
7.0.1 Using “all variables”:
$bt
# A tibble: 18 × 7
year model Variable total brier_class mn_log_loss roc_auc
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 bt BMI_DIFF 30 10 10 10
2 1 bt CAND_DIAG_Congenital.Heart… 30 10 10 10
3 1 bt eGFR_CAND_TX 30 10 10 10
4 1 bt TBILI_CAND_TX 30 10 10 10
5 1 bt CPRA 29 10 10 9
6 1 bt ECMO_CAND_TX_Yes 17 7 4 6
7 1 bt INFECTION_CAND_Yes 10 2 4 4
8 1 bt PCO2_DON 6 2 2 2
9 1 bt LISTING_CTR_PREV_SURVIVAL_… 5 2 2 1
10 1 bt BMI_CAND 4 1 2 1
11 1 bt VENTILATOR_CAND_TX_Yes 4 0 2 2
12 1 bt BUN_DON 3 0 1 2
13 1 bt HEIGHT_CAND_CM 3 0 2 1
14 1 bt ISCHTIME 3 2 1 0
15 1 bt DA1_DON 2 2 0 0
16 1 bt PO2_DON 2 0 0 2
17 1 bt CPR_DURATION_DON 1 1 0 0
18 1 bt HEMATOCRIT_DON 1 1 0 0
$bt1
# A tibble: 15 × 7
year model Variable total brier_class mn_log_loss roc_auc
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 bt1 BMI_DIFF 30 10 10 10
2 1 bt1 CAND_DIAG_Congenital.Heart… 30 10 10 10
3 1 bt1 CPRA 30 10 10 10
4 1 bt1 eGFR_CAND_TX 30 10 10 10
5 1 bt1 TBILI_CAND_TX 30 10 10 10
6 1 bt1 INFECTION_CAND_Yes 16 5 7 4
7 1 bt1 ECMO_CAND_TX_Yes 13 5 4 4
8 1 bt1 HEIGHT_CAND_CM 10 4 1 5
9 1 bt1 VENTILATOR_CAND_TX_Yes 8 2 3 3
10 1 bt1 BMI_DON 5 2 1 2
11 1 bt1 WEIGHT_DON_KG 3 1 2 0
12 1 bt1 TRANSFUSIONS_CAND_Yes 2 0 1 1
13 1 bt1 CPR_DURATION_DON 1 0 0 1
14 1 bt1 ISCHTIME 1 0 1 0
15 1 bt1 POSTERIOR_WALL 1 1 0 0
$lr
# A tibble: 12 × 7
year model Variable total brier_class mn_log_loss roc_auc
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 lr BMI_DIFF_poly_1 30 10 10 10
2 1 lr CAND_DIAG_Congenital.Heart… 30 10 10 10
3 1 lr CPRA 30 10 10 10
4 1 lr TBILI_CAND_TX 30 10 10 10
5 1 lr INFECTION_CAND_Yes 25 8 8 9
6 1 lr ECMO_CAND_TX_Yes 22 7 7 8
7 1 lr VENTILATOR_CAND_TX_Yes 13 4 4 5
8 1 lr DEATH_CIRCUM_DON_Natural.C… 12 5 5 2
9 1 lr TRANSFUSIONS_CAND_Yes 9 3 3 3
10 1 lr RACE_CAND_Hispanic 5 2 2 1
11 1 lr TX_DOW_Monday 3 1 1 1
12 1 lr HEIGHT_CAND_CM 1 0 0 1
$rf
# A tibble: 13 × 7
year model Variable total brier_class mn_log_loss roc_auc
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 rf BMI_CAND 30 10 10 10
2 1 rf BMI_DIFF 30 10 10 10
3 1 rf eGFR_CAND_TX 30 10 10 10
4 1 rf ISCHTIME 30 10 10 10
5 1 rf TBILI_CAND_TX 30 10 10 10
6 1 rf LISTING_CTR_PREV_SURVIVAL_… 21 7 7 7
7 1 rf HRS_DECEASED_AT_CLAMP 10 2 4 4
8 1 rf HEMATOCRIT_DON 7 1 3 3
9 1 rf WEIGHT_CAND_KG 7 3 2 2
10 1 rf DAYS_ACTIVE 6 2 2 2
11 1 rf DAYS_STAT1A 4 2 1 1
12 1 rf WEIGHT_RATIO 4 2 1 1
13 1 rf BSA_CAND 1 1 0 0
7.0.2 Using “candidate only” variables:
$bt
# A tibble: 14 × 7
year model Variable total brier_class mn_log_loss roc_auc
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 bt BMI_DIFF 30 10 10 10
2 1 bt CAND_DIAG_Congenital.Heart… 30 10 10 10
3 1 bt CPRA 30 10 10 10
4 1 bt eGFR_CAND_TX 30 10 10 10
5 1 bt TBILI_CAND_TX 30 10 10 10
6 1 bt HEIGHT_CAND_CM 19 6 7 6
7 1 bt ISCHTIME 16 5 6 5
8 1 bt ECMO_CAND_TX_Yes 7 2 2 3
9 1 bt INFECTION_CAND_Yes 6 1 2 3
10 1 bt BMI_CAND 4 2 1 1
11 1 bt DAYS_STAT1A 3 2 1 0
12 1 bt ALBUM_CAND_REG 2 1 1 0
13 1 bt HLAMIS 2 1 0 1
14 1 bt VENTILATOR_CAND_TX_Yes 1 0 0 1
$bt1
# A tibble: 12 × 7
year model Variable total brier_class mn_log_loss roc_auc
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 bt1 BMI_DIFF 30 10 10 10
2 1 bt1 CAND_DIAG_Congenital.Heart… 30 10 10 10
3 1 bt1 CPRA 30 10 10 10
4 1 bt1 eGFR_CAND_TX 30 10 10 10
5 1 bt1 TBILI_CAND_TX 30 10 10 10
6 1 bt1 HEIGHT_CAND_CM 23 7 9 7
7 1 bt1 INFECTION_CAND_Yes 14 5 4 5
8 1 bt1 ISCHTIME 12 4 4 4
9 1 bt1 ECMO_CAND_TX_Yes 5 1 2 2
10 1 bt1 VENTILATOR_CAND_TX_Yes 3 1 0 2
11 1 bt1 ALBUM_CAND_REG 2 1 1 0
12 1 bt1 TRANSFUSIONS_CAND_Yes 1 1 0 0
$lr
# A tibble: 14 × 7
year model Variable total brier_class mn_log_loss roc_auc
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 lr CAND_DIAG_Congenital.Heart… 30 10 10 10
2 1 lr CPRA 30 10 10 10
3 1 lr TBILI_CAND_TX 28 8 10 10
4 1 lr BMI_DIFF_poly_1 25 7 8 10
5 1 lr HEIGHT_CAND_CM 19 7 7 5
6 1 lr INFECTION_CAND_Yes 19 4 7 8
7 1 lr ECMO_CAND_TX_Yes 16 2 5 9
8 1 lr RACE_CAND_Hispanic 12 6 4 2
9 1 lr eGFR_CAND_TX 8 6 2 0
10 1 lr eGFR_CAND_TX_90 8 6 2 0
11 1 lr VENTILATOR_CAND_TX_Yes 7 1 2 4
12 1 lr TRANSFUSIONS_CAND_Yes 6 2 2 2
13 1 lr ABO_CAND_AB 1 0 1 0
14 1 lr CMV_CAND_Positive 1 1 0 0
$rf
# A tibble: 10 × 7
year model Variable total brier_class mn_log_loss roc_auc
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 rf BMI_CAND 30 10 10 10
2 1 rf BMI_DIFF 30 10 10 10
3 1 rf BSA_CAND 30 10 10 10
4 1 rf eGFR_CAND_TX 30 10 10 10
5 1 rf ISCHTIME 30 10 10 10
6 1 rf TBILI_CAND_TX 27 9 9 9
7 1 rf WEIGHT_CAND_KG 24 8 8 8
8 1 rf DAYS_ACTIVE 3 1 1 1
9 1 rf DAYSWAIT_CHRON 3 1 1 1
10 1 rf HEIGHT_CAND_CM 3 1 1 1
8 Difference between model predictions
This only shows difference from same model but choosing tuning parameters differently. Good - pretty robust to tuning parameters selection

8.1 Differences across models
[1] "starting model lr"
[1] "starting model bt1"
[1] "starting model bt"
[1] "starting model rf"

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

9 Causal Forests

